Skip to content

Improve ConfigNode leader warm-up before serving#17821

Merged
CRZbulabula merged 11 commits into
masterfrom
improve-confignode-leader-confirm
Jun 10, 2026
Merged

Improve ConfigNode leader warm-up before serving#17821
CRZbulabula merged 11 commits into
masterfrom
improve-confignode-leader-confirm

Conversation

@CRZbulabula

Copy link
Copy Markdown
Contributor

Summary

  • Gate ConfigNode leader confirmation on LoadCache warm-up after consensus leader-ready.
  • Track first heartbeat coverage for Nodes, Regions, RegionGroups, and ConsensusGroups before serving requests.
  • Return CONFIG_NODE_LEADER_WARMING_UP during warm-up so DataNodes wait and retry the current ConfigNode instead of treating it as redirection.

Tests

  • mvn spotless:apply -pl iotdb-core/confignode,iotdb-core/datanode,iotdb-client/service-rpc
  • mvn compile -pl iotdb-client/service-rpc,iotdb-core/confignode
  • mvn test -pl iotdb-core/confignode -Dtest=LoadManagerTest
  • mvn compile -pl iotdb-client/service-rpc,iotdb-core/datanode (fails in unrelated existing sources: ArrayDeviceTimeIndex.java and TableDeviceSchemaCache.java still pass IDeviceID to PartialPath.matchFullPath)

@codecov

codecov Bot commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 21.62162% with 232 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.69%. Comparing base (c3e74a2) to head (df8ce33).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...nsensus/statemachine/ConfigRegionStateMachine.java 7.31% 114 Missing ⚠️
...c/handlers/heartbeat/DataNodeHeartbeatHandler.java 0.00% 41 Missing ⚠️
...che/iotdb/db/protocol/client/ConfigNodeClient.java 0.00% 27 Missing ⚠️
...confignode/manager/consensus/ConsensusManager.java 3.70% 26 Missing ⚠️
...che/iotdb/confignode/manager/load/LoadManager.java 74.35% 10 Missing ⚠️
...che/iotdb/confignode/manager/ProcedureManager.java 0.00% 6 Missing ⚠️
...rg/apache/iotdb/confignode/service/ConfigNode.java 0.00% 3 Missing ⚠️
...apache/iotdb/confignode/manager/ConfigManager.java 0.00% 2 Missing ⚠️
...iotdb/confignode/manager/load/cache/LoadCache.java 88.23% 2 Missing ⚠️
.../iotdb/confignode/procedure/ProcedureExecutor.java 83.33% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17821      +/-   ##
============================================
+ Coverage     40.54%   40.69%   +0.14%     
+ Complexity     2622     2621       -1     
============================================
  Files          5244     5244              
  Lines        362367   362567     +200     
  Branches      46651    46678      +27     
============================================
+ Hits         146938   147552     +614     
+ Misses       215429   215015     -414     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Caideyipi Caideyipi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the warm-up changes on a57680d2542. I think there are a few issues that should be fixed before merge:

  1. AINode treats the new warm-up status as a hard failure. ConfigManager.registerAINode() now returns CONFIG_NODE_LEADER_WARMING_UP while confirmLeader() is warming up, but the Python AINode client only treats REDIRECTION_RECOMMEND as retryable in _update_config_node_leader(). node_register() / node_restart() then call verify_success() and raise on status 1014, so an AINode can fail startup if it hits the leader during warm-up. Please add the new code to the AINode constants and retry handling paths.

  2. Non-seed ConfigNode registration has the same gap. registerConfigNode() can now return CONFIG_NODE_LEADER_WARMING_UP, but ConfigNode.sendRegisterConfigNodeRequest() only retries success/redirection/internal-retry statuses and throws StartupException for anything else. A ConfigNode joining during leader warm-up can fail immediately instead of waiting and retrying.

  3. The async leader-service startup has a stepdown race. notifyLeaderReady() now submits startLeaderServicesAfterLoadReady() asynchronously. That task checks isLeaderReady() only once before starting leader-only services and setting leaderServicesReady=true. If notifyNotLeader() runs after that check but before/during service startup, the old task can re-enable services after cleanup. Please guard this with a leader epoch/cancellation token, and re-check before setting leaderServicesReady.

  4. The DataNode register retry budget is too tight for the 30s warm-up tolerance. On CONFIG_NODE_LEADER_WARMING_UP, updateConfigNodeLeader() sleeps 2s and returns retryable, while registerDataNode() has 15 attempts. The final request can still happen before the 30s tolerance expires, then sleep and exit without one post-tolerance attempt. A deadline-based retry or a larger retry budget would avoid this edge case.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves ConfigNode leader “warm-up” semantics so DataNodes avoid premature redirection during leader transitions, and ConfigNode serving is gated on initial heartbeat sampling readiness.

Changes:

  • Add a dedicated CONFIG_NODE_LEADER_WARMING_UP status and have DataNodes wait/retry the current leader during warm-up.
  • Introduce LoadManager.isLoadReady() and a 30s tolerance window to require first heartbeat coverage (ConfigNode/DataNode) before considering load services ready.
  • Track consensus-group heartbeat sampling coverage and add tests for warm-up readiness behavior.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
iotdb-core/datanode/src/main/java/org/apache/iotdb/db/protocol/client/ConfigNodeClient.java Treat CONFIG_NODE_LEADER_WARMING_UP as a wait-and-retry instead of redirection.
iotdb-core/confignode/src/test/java/org/apache/iotdb/confignode/manager/load/LoadManagerTest.java Add tests validating load warm-up readiness criteria and 30s tolerance behavior.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/load/LoadManager.java Add load readiness state machine (isLoadReady, reason strings, tolerance window).
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/load/cache/LoadCache.java Track consensus-group sampled nodes; add node-heartbeat unready reasons; cache “unreported” samples.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/load/cache/consensus/ConsensusGroupCache.java Always update consensus stats from last sample (including “unready leader”).
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/load/cache/AbstractLoadCache.java Add hasHeartbeatSample() helper.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/manager/consensus/ConsensusManager.java Gate leader confirmation on consensus-ready + leader-services-ready + load-ready; return warming-up status.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/consensus/statemachine/ConfigRegionStateMachine.java Track leaderServicesReady and start load services before leader services.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/client/async/handlers/heartbeat/DataNodeHeartbeatHandler.java Improve null-safety and cache consensus/region samples; add missing-region sampling on partial reports.
iotdb-core/confignode/src/main/java/org/apache/iotdb/confignode/client/async/handlers/heartbeat/ConfigNodeHeartbeatHandler.java On error, force-update node cache to Unknown (no connection-broken check).
iotdb-client/service-rpc/src/main/java/org/apache/iotdb/rpc/TSStatusCode.java Add CONFIG_NODE_LEADER_WARMING_UP(1014).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@CRZbulabula

Copy link
Copy Markdown
Contributor Author

@Caideyipi Thanks for the detailed review. Fixed in the latest commits, especially f54b484.

  1. AINode registration now treats CONFIG_NODE_LEADER_WARMING_UP as retryable instead of a hard failure.
  2. Non-seed ConfigNode registration now waits and retries when the leader returns CONFIG_NODE_LEADER_WARMING_UP.
  3. ConfigRegionStateMachine now uses a leader-services epoch guard, serializes startup and cleanup with leaderServicesLock, and re-checks the epoch before marking leader services ready.
  4. DataNode registration now uses a 60s warm-up retry deadline, so it has requests after the 30s first-heartbeat tolerance.

I also cleaned up the follow-up warm-up sampling concerns: removed the unreported DataNode Region heartbeat chain, removed the extra DataNodeHeartbeatHandler region-group argument, and kept consensus sampling to only cache leader samples when the DataNode reports leader=true with a consensus logical timestamp.

@Caideyipi Caideyipi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see one remaining stepdown race that I think should be fixed before merge.

ConfigRegionStateMachine.submitIfLeaderServicesEpochCurrent() checks the epoch only before invoking task.run(). If an async task passes that check, then notifyNotLeader() runs and finishes cleanup, the old task can resume and start leader-only services such as startCQScheduler(), startPipeMetaSync(), startPipeHeartbeat(), or startSubscriptionMetaSync() after the node is no longer leader.

The main startup path is guarded by leaderServicesLock, but these submitted tasks are not serialized with cleanup. Please either run the task under leaderServicesLock and re-check isCurrentLeaderServicesEpoch(epoch) inside the lock, or pass a cancellation/epoch guard into the individual service start paths.

@CRZbulabula

Copy link
Copy Markdown
Contributor Author

@Caideyipi Good catch — thanks. Fixed in e074337 by reworking the leader-services lifecycle so this race can no longer happen.

The root cause was that submitIfLeaderServicesEpochCurrent() only checked the epoch before task.run(), and those submitted tasks were not serialized against notifyNotLeader()'s cleanup. I removed that helper entirely. The new design:

  1. All transitions are serialized on a single-thread executor. notifyLeaderReady (become-leader), notifyNotLeader / notifyLeaderChanged (resign) all submit to one single-thread leaderServicesTransitionExecutor. Because it has exactly one worker, a become-leader orchestration and a resign cleanup can never run concurrently — one runs to completion before the other starts. So startCQScheduler() / startPipeMetaSync() / startPipeHeartbeat() / startSubscriptionMetaSync() can no longer interleave with cleanup.

  2. The epoch is bumped eagerly on resign, before cleanup is even queued. notifyNotLeader calls invalidateLeaderServices() synchronously on the consensus thread, so the epoch advances the instant we lose leadership. An in-flight becomeLeader re-checks isCurrentLeaderServicesEpoch(epoch) after the parallel startups join and again before it sets leaderServicesReady = true, so a stale epoch bails out and never re-enables services after cleanup.

  3. leaderServicesReady is only set inside leaderServicesLock with the epoch re-checked, so the "set ready" step is atomic with respect to the epoch.

Within a single become-leader epoch, load services still start first (for warm-up), then the remaining independent services start in parallel on a cached pool and are joined before the epoch is marked ready. So the check-then-run gap you pointed out is closed both by the single-thread serialization and by the epoch re-check inside the lock.

Refactor ConfigRegionStateMachine so leader become/resign transitions are
strictly serial. All transitions (notifyLeaderReady / notifyNotLeader /
notifyLeaderChanged) are submitted to a single-thread transition executor,
which is the barrier that keeps epochs serial: one transition's orchestration
runs to completion before the next begins.

Within a become-leader epoch, load services start first to warm up as early
as possible, then the remaining independent leader services start in parallel
on a cached pool and are joined before the epoch is marked ready. The epoch is
bumped eagerly on resign so an in-flight startup detects it is stale and bails
out before re-enabling services after cleanup.

This removes the giant lock-wrapped startLeaderServices method and the
per-task submitIfLeaderServicesEpochCurrent helper; leaderServicesLock now only
guards the (epoch, ready) pair.
@CRZbulabula CRZbulabula force-pushed the improve-confignode-leader-confirm branch from 7408a91 to 5c69640 Compare June 9, 2026 11:31
@sonarqubecloud

sonarqubecloud Bot commented Jun 9, 2026

Copy link
Copy Markdown

@CRZbulabula CRZbulabula merged commit ddd8faa into master Jun 10, 2026
45 checks passed
@CRZbulabula CRZbulabula deleted the improve-confignode-leader-confirm branch June 10, 2026 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants